By: Tim Kartawijaya (tak2151)
library(tidyverse)
options(scipen = 999) #turn off scientific notation like 9+05
theme_set(theme_bw()) #set theme for all plot
Data: flower dataset in cluster package
First, we load in the data.
library(cluster)
flower_df = flower
#rename column names
names(flower_df) = c("winters","shadow","tubers","color","soil","preference","height","distance")
#use revalue levels
levels(flower_df$winters) = c("No", "Yes")
levels(flower_df$shadow) = c("No", "Yes")
levels(flower_df$tubers) = c("No", "Yes")
levels(flower_df$color) = c("white","yellow","pink","red","blue")
levels(flower_df$soil) = c("dry","normal","wet")
#display full dataset
flower_df
color and soil variables, using best practices for the order of the bars.Below we provide two bar graphs, first as if the variables are non-ordinal (sorted) to see the most frequent value, then second when the variables are ordinal (non-sorted).
#flower_df$soil <- ordered(flower_df$soil, levels = c("2","1","3", "Blond", "Brown"))
ggplot(data = flower_df, mapping = aes(x = soil)) +
geom_bar(color = "black", fill = "light blue") +
scale_x_discrete(limits = c("normal","wet","dry")) +
labs(x = "Soil", y = "Frequency", title = "Top 3 Soil Types of Popular Flowers")
#flower_df$soil <- ordered(flower_df$soil, levels = c("2","1","3", "Blond", "Brown"))
ggplot(data = flower_df, mapping = aes(x = soil)) +
geom_bar(color = "black", fill = "light blue") +
labs(x = "Soil", y = "Frequency", title = "Top 3 Soil Types of Popular Flowers")
ggplot(data = flower_df, mapping = aes(x = color)) +
geom_bar(color = "black", fill = "light blue") +
scale_x_discrete(limits = c("red","yellow","pink","blue","white")) +
labs(x = "Color", y = "Frequency", title = "Top 5 Colors of Popular Flowers")
Data: MplsDemo dataset in carData package
First, we load in the data.
library(carData)
mp_df = MplsDemo
ggplot(data = mp_df, mapping = aes(x = hhIncome, y = reorder(neighborhood, hhIncome))) +
geom_point(color = "blue") +
labs(x = "Median Household Income ($)", y = "Neighborhood", title = "Median Household Income by Neighborhood")
ggplot(data = mp_df, mapping = aes(y = reorder(neighborhood, collegeGrad))) +
geom_point(mapping = aes(x = collegeGrad), color = "purple") +
geom_point(mapping = aes(x = poverty), color = "black") +
geom_point(mapping = aes(x = foreignBorn), color = "dark orange") +
labs(x = "Proportion (%)", y = "Neighborhood", title = "Foreign Born, In Poverty, and College Grad Proportion by Neighborhood")
We can infer these following neighborhood patterns from the graphs above:
The lower the proportion of people with a college degree (< 25%), the higher the proportion of foreign born (>20%). The exceptions to this pattern are neighborhoods like Seward, Lyndale, and Downtown East.
When there aren’t many people with college degrees (< ~25%), the poverty proportion generally is above normal (10%), excepting some cities like Holland, Phillip West, Etc.
The higher the median household income, the higher the proportion of college grads. We can see this more clearly from the additional Cleveland Dotplot below, showing that there is a positive linear relationship between proportion of college grad and median household income. Exceptions to this pattern are neighborhoods like Prospect Park and East River Road.
ggplot(data = mp_df, mapping = aes(y = reorder(neighborhood, collegeGrad))) +
geom_point(mapping = aes(x = hhIncome)) +
labs(x = "Median Household Income ($)", y = "Neighborhood", title = "Median Household Income by Neighborhood ordered by College Grad Proportion")
Data: NYC yellow cab rides in June 2018, available here:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
It’s a large file so work with a reasonably-sized random subset of the data.
First, let’s import the data from the raw csv file.
cab = read_csv("yellow_tripdata_2018-06.csv")
Next, let’s draw a random sample of 10,000 values.
set.seed(12)
cab_sample = cab[sample(nrow(cab),10000),]
Draw four scatterplots of tip_amount vs. fare_amount with the following variations:
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) +
geom_point(alpha = 0.3, color = "black") +
labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount")
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) +
geom_point(alpha = 0.3, color = "black") +
labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount") +
geom_density_2d()
The density contour plot is a little difficult to see here since there are a significant number of values clustered in the lower left corner of the plot.
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) +
labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount") +
geom_hex(binwidth = c(5,35))
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) +
labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount") +
stat_bin2d(binwidth = c(2,16))
For all, adjust parameters to the levels that provide the best views of the data.
There are many insights that can be derived from the plots above:
Data: olives dataset in extracat package
library(extracat)
olives_df = olives
From the plot below, the pairs of variables that have a strongly positive association are:
Those with a weaker positive association:
Conversely, the pairs of variables that have a strongly negative association are:
pairs(olives_df[,3:10])
From the colored scatterplot matrix below, we can observe several things:
pairs(olives_df[,3:10], col = olives_df$Region)
#create legend
par(xpd = TRUE)
legend(-0.05,1.05,fill = unique(olives_df$Region), legend = c(levels(olives_df$Region)))
Data: wine dataset in pgmm package
(Recode the Type variable to descriptive names.)
Type. Present the version that you find to be most informative. You do not need to include all of the variables.First, we load the data.
library(pgmm)
data(wine)
Then, we create the parallel coordinate plot using variables for which the three types had significantly differentiated values. In this case, the variables used are:
library(GGally)
wine$Type = factor(wine$Type)
ggparcoord(wine, columns = c(2,16,17,20,27), groupColumn = 'Type', scale = 'std', alpha = 0.55)